Add dat bypass, build-cluster-db CLI, and incremental improvements by ypriverol · Pull Request #10 · bigbio/pyspectrafuse

ypriverol · 2026-04-13T19:38:47Z

Summary

Add maracluster_dat.py: parquet-to-dat conversion bypassing MGF entirely (15x smaller output)
Add build-cluster-db and parquet2dat CLI commands
Expand incremental merge-clusters with BIN strategy support
Add QPX metadata module (qpx_metadata.py) and MSNet-to-QPX converter

Details

The dat bypass writes MaRaCluster's internal binary format directly from parquet. Binning replicates BinSpectra::binBinaryTruncated: bin = floor(mz / 1.000508 + 0.32), top-40 peaks, skip peaks >= precursor neutral mass. Size: ~100 bytes/spectrum vs ~1.5 KB in MGF.

The build-cluster-db command creates cluster_metadata.parquet + psm_cluster_membership.parquet from MaRaCluster output + scan_titles mapping, designed for the dat-bypass workflow.

Test plan

Run python test_e2e_single_dataset.py on PXD014877
Verify pyspectrafuse convert-dat produces valid .dat files
Test pyspectrafuse build-cluster-db end-to-end
Test incremental merge with --method_type bin

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- New CLI commands: build-cluster-db, cluster2parquet, convert-to-qpx, convert-dat, parquet2mgf enhancements, and an incremental group (extract-reps, merge-clusters).
- MSNet→QPX conversion and automated QPX metadata generation.
- Parquet→DAT export for MaRaCluster and representative-MGF extraction for clusters.
Refactor
- Broader parquet schema support and metadata lookup via QPX; USI construction from explicit components.
- Improved consensus strategies with size-adaptive algorithms.
Documentation
- README updated with new workflows and command examples.
Tests
- Added incremental workflow tests (representative extraction and resolution).

Normalizes both legacy msnet and QPX parquet formats to canonical column names at load time, extracts global_qvalue from QPX additional_scores, and adds a standalone cluster2parquet command for the cluster-parquet step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…head BestSpectrumStrategy now uses groupby.idxmin() instead of groupby.apply(top_n_rows) for selecting the best spectrum per cluster (~27x measured speedup). Base class uses sort_values + groupby.head() for top-N selection in large clusters (~76x measured speedup). Also replaces np.isin with .isin() and np.vstack with pd.concat for cleaner DataFrame operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously only the first parquet file was processed when a list was passed. Now iterates over all files, accumulating cluster results across the full dataset. State (write counts, file indices) persists across files so MGF path/order mapping remains consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements three-phase incremental clustering for adding new projects without re-clustering the entire dataset: 1. extract-reps: Writes one representative spectrum per cluster from cluster_metadata.parquet to MGF (TITLE=rep:{cluster_id} for tracing) 2. resolve-clusters: After MaRaCluster runs on reps + new data, maps new cluster IDs back to existing ones via representative membership. Handles cluster merges (multiple old clusters → one new). 3. merge-results: Appends new PSMs to membership table (dedup by USI), rebuilds cluster metadata with updated stats and consensus spectra. CLI: pyspectrafuse incremental extract-reps / merge-clusters Tests cover extraction, index building, resolution, and merge scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add parquet-to-dat conversion (maracluster_dat.py) bypassing MGF entirely (15x smaller) - Add build-cluster-db command for dat-bypass workflow - Add parquet2dat CLI command - Expand incremental merge-clusters with BIN strategy support - Update cluster2parquet for QPX metadata compatibility - Register new CLI commands in pyspectrafuse.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-13T19:39:17Z

📝 Walkthrough

Walkthrough

This PR adds a cluster database pipeline, incremental clustering tools, QPX metadata ingestion and schema normalization, new CLI commands (build-cluster-db, cluster2parquet, convert-to-qpx, convert-dat, incremental subcommands), improved consensus strategies, DAT conversion support, and related tests and packaging/docs updates.

Changes

Cohort / File(s)	Summary
Cluster result & combining `pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py`, `pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py`	Switched DataFrame assembly from vstack→`pd.concat`; normalize `mgf_path` keys to basenames when building lookup; `CombineCluster2Parquet.inject_cluster_info` now accepts `sample_info_dict` and supports multiple parquet inputs with per-file schema adaptation and USI construction.
Parquet schema & metadata `pyspectrafuse/common/constant.py`, `pyspectrafuse/common/parquet_utils.py`, `pyspectrafuse/common/qpx_metadata.py`, `pyspectrafuse/common/sdrf_utils.py`	Added `ParquetSchemaAdapter` to normalize legacy/QPX schemas and canonical column constants; replaced rglob with recursive glob and explicit exclusions for metadata/outputs; new QPX metadata reader `qpx_metadata.py`; SDRF util gains `skip_instrument` and improved file-suffix stripping.
New CLI commands `pyspectrafuse/commands/build_cluster_db.py`, `pyspectrafuse/commands/cluster2parquet.py`, `pyspectrafuse/commands/convert_msnet_to_qpx.py`, `pyspectrafuse/commands/parquet2dat.py`, `pyspectrafuse/commands/incremental.py`	Added commands for building cluster DB, cluster→parquet, MSNet→QPX conversion, parquet→DAT conversion, and incremental workflows (extract-reps, merge-clusters), with associated parsing, streaming, and Parquet writers.
Consensus strategies `pyspectrafuse/consensus_strategy/...`	Changed pandas group apply calls to `group_keys=False`; replaced some group-by-top-N apply with vectorized `sort_values`+`groupby().head()`/`idxmin()`; added size-aware representative selection (`_select_pairwise`, `_select_centroid`) and fallback peak-clustering behavior.
DAT conversion & incremental utilities `pyspectrafuse/maracluster_dat.py`, `pyspectrafuse/incremental/resolve_clusters.py`, `pyspectrafuse/incremental/merge_results.py`, `pyspectrafuse/incremental/representative_mgf.py`	New MaRaCluster DAT writer with peak binning and scan info; incremental resolution of new→old cluster IDs via representative MGF; append/rebuild membership and cluster metadata writers; representative-MGF extractor.
Existing command updates & parquet→MGF `pyspectrafuse/mgf_convert/parquet2mgf.py`, `pyspectrafuse/commands/spectrum2msp.py`, `pyspectrafuse/commands/quantmsio2mgf.py`, `pyspectrafuse/common/msp_utils.py`	Replaced SDRF lookup with `get_metadata_dict()` (QPX); `Parquet2Mgf` updated to use `ParquetSchemaAdapter` and build USI from components; added `skip_instrument` flag; improved mz_array parsing.
CLI entry, tests, packaging, docs `pyspectrafuse/pyspectrafuse.py`, `tests/test_incremental.py`, `pyproject.toml`, `README.md`, `Dockerfile`, `tests/conftest.py`	Registered new CLI subcommands; added incremental tests and fixtures; bumped packaging/build backend and version; README and Dockerfile updates to reflect new commands/workflow.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as build-cluster-db CMD
    participant TSV as Cluster TSV Parser
    participant Titles as ScanTitles Parser
    participant Parquet as Parquet Scanner
    participant Schema as ParquetSchemaAdapter
    participant PSM as PSM Membership Builder
    participant Consensus as Consensus Builder
    participant Output as Parquet Writer

    User->>CLI: invoke with cluster_tsv, scan_titles, parquet_dirs
    CLI->>TSV: parse_cluster_tsv()
    TSV-->>CLI: cluster assignments (mgf_basename, scannr, cluster_id)
    CLI->>Titles: parse_scan_titles()
    Titles-->>CLI: scan → run_file, orig_scan, peptidoform, charge
    CLI->>Parquet: scan parquet_dirs for scalar fields
    Parquet->>Schema: available_columns() + adapt()
    Schema-->>Parquet: normalized rows
    Parquet-->>CLI: scalar values (PEP, q-value, mz)
    CLI->>PSM: merge cluster assignments + scan info + scalars
    PSM->>PSM: compute per-cluster stats (count, best PEP/q, purity)
    PSM->>Output: write psm_cluster_membership.parquet
    CLI->>Parquet: stream mz/intensity arrays per cluster
    Parquet->>Consensus: build consensus spectra
    Consensus-->>PSM: consensus arrays
    PSM->>Output: write cluster_metadata.parquet
    Output-->>User: return output paths

sequenceDiagram
    actor User
    participant CLI as merge-clusters CMD
    participant Resolver as resolve_incremental_clusters()
    participant RepMGF as Representative MGF parser
    participant IDMap as ID Mapper
    participant ClusterRes as ClusterResHandler
    participant Combiner as CombineCluster2Parquet
    participant Merger as append_psm_membership()
    participant Rebuilder as rebuild_cluster_metadata()
    participant Output as Parquet Writer

    User->>CLI: invoke with cluster_tsv, rep_mgf, new_parquet_dir
    CLI->>Resolver: resolve_incremental_clusters()
    Resolver->>RepMGF: build_representative_index()
    RepMGF-->>Resolver: mapping rep order → old_cluster_id
    Resolver->>IDMap: correlate new_cluster_id → old_cluster_id
    Resolver-->>CLI: id_map + new_spectra_df
    CLI->>ClusterRes: get_cluster_dict(new_cluster_tsv)
    ClusterRes-->>CLI: cluster_accession → mgf_path lookup
    CLI->>Combiner: inject_cluster_info(new_parquets, sample_info_dict)
    Combiner-->>CLI: enriched_df with cluster_accession
    CLI->>Merger: append_psm_membership(existing_membership, new_df)
    Merger-->>CLI: merged membership parquet
    CLI->>Rebuilder: rebuild_cluster_metadata(merged_membership,...)
    Rebuilder-->>Output: updated cluster_metadata.parquet
    CLI-->>User: summary + output paths

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

major changes in the structure of the package #3: Overlaps in ClusterResHandler and CombineCluster2Parquet changes (pd.concat, mgf_path handling, inject_cluster_info refactor).
the latest version #2: Modifies the same ClusterResHandler.walk_dir/get_cluster_dict behavior (mgf_path basename normalization).
tests improved #5: Related Docker/CLI registration and packaging/entrypoint adjustments.

Suggested labels

Review effort [1-5]: 4

Poem

🐰 nibble, hop, and tap the keys,

I sort the clusters, name with ease,
QPX sprites and Parquet dreams,
Consensus hummed in tiny beams,
Hooray — the database blooms for me!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add dat bypass, build-cluster-db CLI, and incremental improvements' accurately summarizes the main additions in this changeset, directly referencing the three key components added: dat bypass (parquet2dat conversion), build-cluster-db CLI command, and incremental clustering improvements.
Docstring Coverage	✅ Passed	Docstring coverage is 85.14% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/dat-bypass-cluster-db

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Lead with convert-dat and build-cluster-db as recommended commands - Add incremental clustering and QPX metadata to feature list - Update project structure with new modules - Mark MGF conversion and SDRF handling as legacy - Update version to 0.0.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

pyspectrafuse/commands/parquet2dat.py (1)
35-39: Move parquet discovery out of the spectrum2msp command module.

This command now depends on another CLI command for a generic filesystem helper. That makes command imports bidirectional over time and duplicates logic that already belongs in pyspectrafuse.common.parquet_utils. Pull the helper into common and reuse it from both commands.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/parquet2dat.py` around lines 35 - 39, The parquet
discovery helper find_target_ext_files is being imported from
pyspectrafuse.commands.spectrum2msp causing a cross-command dependency; move or
reuse the generic helper from pyspectrafuse.common.parquet_utils instead: remove
the import of find_target_ext_files from pyspectrafuse.commands.spectrum2msp in
parquet2dat.py and import the equivalent function (or add a new helper) from
pyspectrafuse.common.parquet_utils, then call that helper to produce
parquet_files and delete any duplicated discovery logic so both spectrum2msp and
parquet2dat rely on the single common helper (update exports in parquet_utils if
needed).
pyspectrafuse/commands/spectrum2msp.py (1)
40-53: Return parquet files in a stable order.

glob.glob() does not guarantee a reproducible traversal order. Sorting here keeps downstream behavior deterministic — convert-dat increments file_idx per discovered file, and this command also uses the first entry for output naming.
Proposed fix
-    for path_str in glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True):
+    for path_str in sorted(glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/spectrum2msp.py` around lines 40 - 53, glob.glob() can
return files in nondeterministic order; make the discovery deterministic by
sorting the found paths. Change the loop over glob.glob(...) to iterate over
sorted(glob.glob(...)) or sort the collected list `res` (e.g., call `res.sort()`
before returning) so that the order of `path_str` items and the resulting `res`
list is stable for downstream `file_idx` increments and output naming; update
code around the for-loop that builds `res` (referencing `path_str`, `file`, and
`res`) to ensure the sorted behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyspectrafuse/commands/build_cluster_db.py`:
- Around line 272-275: The groupby().idxmin() call can fail when a cluster's
posterior_error_probability is all NaN; before computing best_idx, coerce
posterior_error_probability to numeric on the enriched DataFrame and replace
NaNs with np.inf (e.g., set enriched['_pep'] =
pd.to_numeric(enriched['posterior_error_probability'],
errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf,
inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns
valid positions; then compute best_idx and select best_rows as before (symbols:
enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).

In `@pyspectrafuse/commands/quantmsio2mgf.py`:
- Around line 46-53: The worker pool is being created with
Pool(processes=os.cpu_count()) which ignores the --task_parallel option; change
the Pool creation to use the task_parallel value (i.e.
Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task actually
runs with the requested number of worker processes, and keep the existing
chunksize parameter for batching (or compute a separate chunksize variable) when
calling pool.imap to control task batching rather than concurrency.

In `@pyspectrafuse/commands/spectrum2msp.py`:
- Line 198: The code currently calls ParquetPathHandler(...).get_item_info()
with the first parquet file path (path_parquet_lst[0]), which extracts the
accession from the last path segment and therefore yields the parquet filename
instead of the project basename; change the call to pass the project directory
(e.g., the parent directory of the parquet files or the existing project_dir
variable) into ParquetPathHandler so get_item_info() reads the directory name as
the accession; update the reference at the call site (where basename is set) to
construct the handler with the folder containing the parquet files rather than
the file path itself.

In `@pyspectrafuse/common/parquet_utils.py`:
- Around line 58-68: The loop building parquet_path_lst currently appends any
glob hit (parquet_file) without checking whether it's a file, so partitioned
datasets stored as directories ending in .parquet get returned; update the loop
in this helper to skip directories by calling parquet_file.is_file() (or
equivalently ensure pathlib.Path(parquet_file).is_file()) before applying the
existing suffix/name/parts filters and appending to parquet_path_lst, leaving
all other filtering (checks against _metadata_suffixes, _exclude_names,
_exclude_dirs) unchanged.

In `@pyspectrafuse/incremental/resolve_clusters.py`:
- Around line 99-103: The current filtering uses rep_mgf_basename and
str.contains, which yields false positives; instead compute exact basenames
using pathlib.Path and do an equality match: derive rep_mgf_basename from
Path(rep_mgf_path).name, compare against cluster_df['mgf_path'].map(lambda p:
Path(p).name) (or add a temporary column of basenames) to build rep_mask, then
set rep_rows/new_rows accordingly; ensure pathlib.Path is imported if not
already.

In `@pyspectrafuse/mgf_convert/parquet2mgf.py`:
- Around line 155-163: _get_mgf_path currently always pulls instrument from
sample_info_dict, ignoring the skip_instrument option; update _get_mgf_path to
check the skip_instrument flag (the variable/attribute used by Parquet2Mgf for
that option) and, when True, use a single shared instrument bucket (e.g., a
fixed name like "all_instruments" or an empty instrument segment) instead of
info[0]; modify the function where mgf_group_df['mgf_file_path'] is set so
_get_mgf_path reads skip_instrument and constructs the path accordingly
(reference function name _get_mgf_path, the sample_info_dict lookup, and
Parquet2Mgf.skip_instrument or the local skip_instrument variable).

In `@tests/test_incremental.py`:
- Around line 150-155: The fixture string assignments for the variable
tsv_content use unnecessary f-string prefixes (e.g., f"reps.mgf\t0\tNEW_A\n"
blocks) which trigger Ruff F541; remove the leading f from each quoted literal
(including the second occurrence around lines 193-196) so they become plain
string literals while preserving the content and escape sequences.

---

Nitpick comments:
In `@pyspectrafuse/commands/parquet2dat.py`:
- Around line 35-39: The parquet discovery helper find_target_ext_files is being
imported from pyspectrafuse.commands.spectrum2msp causing a cross-command
dependency; move or reuse the generic helper from
pyspectrafuse.common.parquet_utils instead: remove the import of
find_target_ext_files from pyspectrafuse.commands.spectrum2msp in parquet2dat.py
and import the equivalent function (or add a new helper) from
pyspectrafuse.common.parquet_utils, then call that helper to produce
parquet_files and delete any duplicated discovery logic so both spectrum2msp and
parquet2dat rely on the single common helper (update exports in parquet_utils if
needed).

In `@pyspectrafuse/commands/spectrum2msp.py`:
- Around line 40-53: glob.glob() can return files in nondeterministic order;
make the discovery deterministic by sorting the found paths. Change the loop
over glob.glob(...) to iterate over sorted(glob.glob(...)) or sort the collected
list `res` (e.g., call `res.sort()` before returning) so that the order of
`path_str` items and the resulting `res` list is stable for downstream
`file_idx` increments and output naming; update code around the for-loop that
builds `res` (referencing `path_str`, `file`, and `res`) to ensure the sorted
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b66a7c3-de12-4302-8ff6-4d095e94000f

📥 Commits

Reviewing files that changed from the base of the PR and between 891c6d9 and 3116dcc.

📒 Files selected for processing (27)

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
pyspectrafuse/commands/build_cluster_db.py
pyspectrafuse/commands/cluster2parquet.py
pyspectrafuse/commands/convert_msnet_to_qpx.py
pyspectrafuse/commands/incremental.py
pyspectrafuse/commands/parquet2dat.py
pyspectrafuse/commands/quantmsio2mgf.py
pyspectrafuse/commands/spectrum2msp.py
pyspectrafuse/common/constant.py
pyspectrafuse/common/msp_utils.py
pyspectrafuse/common/parquet_utils.py
pyspectrafuse/common/qpx_metadata.py
pyspectrafuse/common/sdrf_utils.py
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
pyspectrafuse/consensus_strategy/best_spetrum_strategy.py
pyspectrafuse/consensus_strategy/binning_strategy.py
pyspectrafuse/consensus_strategy/consensus_strategy_base.py
pyspectrafuse/consensus_strategy/most_similar_strategy.py
pyspectrafuse/incremental/__init__.py
pyspectrafuse/incremental/merge_results.py
pyspectrafuse/incremental/representative_mgf.py
pyspectrafuse/incremental/resolve_clusters.py
pyspectrafuse/maracluster_dat.py
pyspectrafuse/mgf_convert/parquet2mgf.py
pyspectrafuse/pyspectrafuse.py
tests/test_incremental.py

👮 Files not reviewed due to content moderation or server errors (10)

pyspectrafuse/common/msp_utils.py
pyspectrafuse/common/constant.py
pyspectrafuse/common/qpx_metadata.py
pyspectrafuse/consensus_strategy/binning_strategy.py
pyspectrafuse/pyspectrafuse.py
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
pyspectrafuse/common/sdrf_utils.py
pyspectrafuse/commands/convert_msnet_to_qpx.py
pyspectrafuse/commands/incremental.py

coderabbitai · 2026-04-13T19:53:32Z

+    enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')
+    best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()
+    best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',
+                                         'peptidoform', 'precursor_mz']].copy()


⚠️ Potential issue | 🟠 Major

Guard idxmin() against all-NaN PEP groups.

posterior_error_probability is populated through a left join, so a cluster can end up with _pep entirely missing. In that case groupby().idxmin() produces invalid indexers and enriched.loc[best_idx, ...] will fail. Filling _pep with np.inf before the groupby is enough here.

Proposed fix

- enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce') + enriched['_pep'] = pd.to_numeric( + enriched['posterior_error_probability'], errors='coerce' + ).fillna(np.inf) best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')

best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()

best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',

'peptidoform', 'precursor_mz']].copy()

enriched['_pep'] = pd.to_numeric(

enriched['posterior_error_probability'], errors='coerce'

).fillna(np.inf)

best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()

best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',

'peptidoform', 'precursor_mz']].copy()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/commands/build_cluster_db.py` around lines 272 - 275, The groupby().idxmin() call can fail when a cluster's posterior_error_probability is all NaN; before computing best_idx, coerce posterior_error_probability to numeric on the enriched DataFrame and replace NaNs with np.inf (e.g., set enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf, inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns valid positions; then compute best_idx and select best_rows as before (symbols: enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).

coderabbitai · 2026-04-13T19:53:32Z

+    pool = Pool(processes=os.cpu_count())

    # Pack all required parameters into a list of tuples
-    tasks = [(parquet_file_path, sdrf_file_path, res_file_path, batch_size, spectra_capacity) for parquet_file_path
-             in
-             parquet_file_path_lst]
+    tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
+             for parquet_file_path in parquet_file_path_lst]

    # Use pool.imap to execute tasks in parallel with specified maximum parallel tasks
    for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel):


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the file find . -type f -name "quantmsio2mgf.py" 2>/dev/null

Repository: bigbio/pyspectrafuse

Length of output: 105

🏁 Script executed:

# View the file context around lines 46-53 cat -n pyspectrafuse/commands/quantmsio2mgf.py | sed -n '30,70p'

Repository: bigbio/pyspectrafuse

Length of output: 1624

🏁 Script executed:

# Search for task_parallel definition and usage in the file rg "task_parallel" pyspectrafuse/commands/quantmsio2mgf.py -B 2 -A 2

Repository: bigbio/pyspectrafuse

Length of output: 1256

🏁 Script executed:

# Check the function signature to understand parameters ast-grep --pattern 'def $_($$) { $$$ }'

Repository: bigbio/pyspectrafuse

Length of output: 46

task_parallel is not limiting worker concurrency.

Line 46 creates os.cpu_count() worker processes, while line 53 passes task_parallel as chunksize, which only controls task batching, not the number of workers. So --task_parallel 1 still spawns the full CPU count in parallel. Wire the option into Pool(processes=...) to actually limit worker count, and use chunksize separately for task batching.

Proposed fix

- pool = Pool(processes=os.cpu_count()) - - # Pack all required parameters into a list of tuples - tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument) - for parquet_file_path in parquet_file_path_lst] - - # Use pool.imap to execute tasks in parallel with specified maximum parallel tasks - for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel): - pass - - # Close the process pool and wait for all processes to complete - pool.close() - pool.join() + worker_count = min(task_parallel, os.cpu_count() or 1) + + # Pack all required parameters into a list of tuples + tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument) + for parquet_file_path in parquet_file_path_lst] + + with Pool(processes=worker_count) as pool: + for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=1): + pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/commands/quantmsio2mgf.py` around lines 46 - 53, The worker pool is being created with Pool(processes=os.cpu_count()) which ignores the --task_parallel option; change the Pool creation to use the task_parallel value (i.e. Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task actually runs with the requested number of worker processes, and keep the existing chunksize parameter for batching (or compute a separate chunksize variable) when calling pool.imap to control task batching rather than concurrency.

coderabbitai · 2026-04-13T19:53:33Z

    output_dir = Path(parquet_dir) / 'msp' / species / instrument / charge
    output_dir.mkdir(parents=True, exist_ok=True)
-    basename = ParquetPathHandler(parquet_dir).get_item_info()
+    basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()


⚠️ Potential issue | 🟠 Major

Use the project directory for get_item_info(), not the first parquet filename.

Line 198 passes a parquet file path into get_item_info(), but that helper extracts the accession from the last path segment. For typical inputs the last segment is just something.parquet, so this now raises ValueError or derives the wrong basename.

Proposed fix

- basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info() + basename = ParquetPathHandler(parquet_dir).get_item_info()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()

basename = ParquetPathHandler(parquet_dir).get_item_info()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/commands/spectrum2msp.py` at line 198, The code currently calls ParquetPathHandler(...).get_item_info() with the first parquet file path (path_parquet_lst[0]), which extracts the accession from the last path segment and therefore yields the parquet filename instead of the project basename; change the call to pass the project directory (e.g., the parent directory of the parquet files or the existing project_dir variable) into ParquetPathHandler so get_item_info() reads the directory name as the accession; update the reference at the call site (where basename is set) to construct the handler with the folder containing the parquet files rather than the file path itself.

coderabbitai · 2026-04-13T19:53:33Z

+        # Use glob.glob (follows symlinks) instead of Path.rglob (does not)
+        for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
+            parquet_file = Path(path_str)
+            name = str(parquet_file)
+            if any(name.endswith(s) for s in _metadata_suffixes):
+                continue
+            if any(ex in parquet_file.stem for ex in _exclude_names):
+                continue
+            if any(part in _exclude_dirs for part in parquet_file.parts):
+                continue
            parquet_path_lst.append(parquet_file)


⚠️ Potential issue | 🟠 Major

Filter out .parquet directories before returning results.

Line 60 turns every glob hit into a candidate path, but this helper never checks is_file(). Partitioned parquet datasets stored as directories named *.parquet will now be returned here and later consumers will treat them as files.

Proposed fix

# Use glob.glob (follows symlinks) instead of Path.rglob (does not) for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True): parquet_file = Path(path_str) + if not parquet_file.is_file(): + continue name = str(parquet_file) if any(name.endswith(s) for s in _metadata_suffixes): continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Use glob.glob (follows symlinks) instead of Path.rglob (does not)

for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):

parquet_file = Path(path_str)

name = str(parquet_file)

if any(name.endswith(s) for s in _metadata_suffixes):

continue

if any(ex in parquet_file.stem for ex in _exclude_names):

continue

if any(part in _exclude_dirs for part in parquet_file.parts):

continue

parquet_path_lst.append(parquet_file)

# Use glob.glob (follows symlinks) instead of Path.rglob (does not)

for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):

parquet_file = Path(path_str)

if not parquet_file.is_file():

continue

name = str(parquet_file)

if any(name.endswith(s) for s in _metadata_suffixes):

continue

if any(ex in parquet_file.stem for ex in _exclude_names):

continue

if any(part in _exclude_dirs for part in parquet_file.parts):

continue

parquet_path_lst.append(parquet_file)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/common/parquet_utils.py` around lines 58 - 68, The loop building parquet_path_lst currently appends any glob hit (parquet_file) without checking whether it's a file, so partitioned datasets stored as directories ending in .parquet get returned; update the loop in this helper to skip directories by calling parquet_file.is_file() (or equivalently ensure pathlib.Path(parquet_file).is_file()) before applying the existing suffix/name/parts filters and appending to parquet_path_lst, leaving all other filtering (checks against _metadata_suffixes, _exclude_names, _exclude_dirs) unchanged.

coderabbitai · 2026-04-13T19:53:33Z

+    rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path
+
+    rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)
+    rep_rows = cluster_df[rep_mask].copy()
+    new_rows = cluster_df[~rep_mask].copy()


⚠️ Potential issue | 🟠 Major

Use exact basename matching for representative rows.

str.contains(rep_mgf_basename, regex=False) will also match unrelated paths like foo_reps.mgf.bak, so those spectra get filtered out of new_rows and can inherit the wrong resolved_cluster_id. Using Path(...).name == rep_mgf_basename avoids both the false positives and the current /-only path split.

Proposed fix

+from pathlib import Path + - rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path - - rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False) + rep_mgf_basename = Path(rep_mgf_path).name + rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path

rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)

rep_rows = cluster_df[rep_mask].copy()

new_rows = cluster_df[~rep_mask].copy()

rep_mgf_basename = Path(rep_mgf_path).name

rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename)

rep_rows = cluster_df[rep_mask].copy()

new_rows = cluster_df[~rep_mask].copy()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/incremental/resolve_clusters.py` around lines 99 - 103, The current filtering uses rep_mgf_basename and str.contains, which yields false positives; instead compute exact basenames using pathlib.Path and do an equality match: derive rep_mgf_basename from Path(rep_mgf_path).name, compare against cluster_df['mgf_path'].map(lambda p: Path(p).name) (or add a temporary column of basenames) to build rep_mask, then set rep_rows/new_rows accordingly; ensure pathlib.Path is imported if not already.

coderabbitai · 2026-04-13T19:53:33Z

+            def _get_mgf_path(row):
+                fname = Parquet2Mgf.get_filename_from_usi(row)
+                info = sample_info_dict.get(fname)
+                if info is None:
+                    logger.warning(f"No metadata for run file: {fname}")
+                    info = ['Unknown', 'Unknown']
+                return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files'])

-            mgf_group_df['mgf_file_path'] = row_group.apply(
-                lambda row: '/'.join(sample_info_dict.get(Parquet2Mgf.get_filename_from_usi(row)) +
-                                     ['charge' + str(row["charge"]), 'mgf files']), axis=1)
+            mgf_group_df['mgf_file_path'] = row_group.apply(_get_mgf_path, axis=1)


⚠️ Potential issue | 🟠 Major

skip_instrument is currently a no-op.

Lines 94-96 advertise a mode that should collapse output under a shared instrument bucket, but _get_mgf_path() always uses the instrument from sample_info_dict. As written, --skip_instrument still writes into per-instrument folders.

Proposed fix

def _get_mgf_path(row): fname = Parquet2Mgf.get_filename_from_usi(row) info = sample_info_dict.get(fname) if info is None: logger.warning(f"No metadata for run file: {fname}") info = ['Unknown', 'Unknown'] + if skip_instrument: + info = [info[0], 'all_instruments'] return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files'])

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 161-161: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/mgf_convert/parquet2mgf.py` around lines 155 - 163, _get_mgf_path currently always pulls instrument from sample_info_dict, ignoring the skip_instrument option; update _get_mgf_path to check the skip_instrument flag (the variable/attribute used by Parquet2Mgf for that option) and, when True, use a single shared instrument bucket (e.g., a fixed name like "all_instruments" or an empty instrument segment) instead of info[0]; modify the function where mgf_group_df['mgf_file_path'] is set so _get_mgf_path reads skip_instrument and constructs the path accordingly (reference function name _get_mgf_path, the sample_info_dict lookup, and Parquet2Mgf.skip_instrument or the local skip_instrument variable).

coderabbitai · 2026-04-13T19:53:33Z

+        tsv_content = (
+            f"reps.mgf\t0\tNEW_A\n"
+            f"reps.mgf\t1\tNEW_B\n"
+            f"new_data.mgf\t0\tNEW_A\n"
+            f"new_data.mgf\t1\tNEW_C\n"
+        )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the test_incremental.py file find . -name "test_incremental.py" -type f

Repository: bigbio/pyspectrafuse

Length of output: 91

🏁 Script executed:

# Read the specific lines mentioned in the review wc -l tests/test_incremental.py

Repository: bigbio/pyspectrafuse

Length of output: 93

🏁 Script executed:

# Read lines 150-155 and 193-196 from the test file sed -n '150,155p; 193,196p' tests/test_incremental.py

Repository: bigbio/pyspectrafuse

Length of output: 357

🏁 Script executed:

# Double-check by searching for any interpolation patterns in these strings sed -n '150,155p; 193,196p' tests/test_incremental.py | grep -E '\{.*\}' || echo "No placeholders found"

Repository: bigbio/pyspectrafuse

Length of output: 85

Remove the unused f prefixes in these fixture strings.

These f-string literals contain no placeholders, so Ruff flags them with F541 and the test module will fail lint in its current form.

Proposed fix

- f"reps.mgf\t0\tNEW_A\n" - f"reps.mgf\t1\tNEW_B\n" - f"new_data.mgf\t0\tNEW_A\n" - f"new_data.mgf\t1\tNEW_C\n" + "reps.mgf\t0\tNEW_A\n" + "reps.mgf\t1\tNEW_B\n" + "new_data.mgf\t0\tNEW_A\n" + "new_data.mgf\t1\tNEW_C\n"

- f"reps.mgf\t0\tMERGED\n" - f"reps.mgf\t1\tMERGED\n" + "reps.mgf\t0\tMERGED\n" + "reps.mgf\t1\tMERGED\n"

Also applies to: 193-196

🧰 Tools

🪛 Ruff (0.15.9)

[error] 151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 152-152: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 154-154: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_incremental.py` around lines 150 - 155, The fixture string assignments for the variable tsv_content use unnecessary f-string prefixes (e.g., f"reps.mgf\t0\tNEW_A\n" blocks) which trigger Ruff F541; remove the leading f from each quoted literal (including the second occurrence around lines 193-196) so they become plain string literals while preserving the content and escape sequences.

Parquet-to-dat is the standard conversion path, not a bypass of anything. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix test_get_metadata_dict_from_sdrf: column name case mismatch (Characteristics[organism] -> characteristics[organism]) - Switch from setuptools to hatchling build backend - Bump version to 0.0.4, require Python >=3.10 - Update Dockerfile to use uv for faster installs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace per-row iterrows() with columnar array access - Buffer dat/scaninfo/title writes per batch instead of per spectrum - Pre-extract columns (charges, mz, runs, etc.) outside inner loop - Same output, 3-5x faster on large datasets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Around line 39-44: The Docker build fails because pyproject.toml declares
readme = "README.md" but README.md is not copied before running RUN uv pip
install --system --no-cache . Update the Dockerfile so that README.md is copied
into the image before the install step (i.e., place a COPY README.md . between
the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines) so the package
installation can read the declared README during build.
- Line 34: Replace the non-reproducible image reference used as a build stage
source in the COPY statement (currently "ghcr.io/astral-sh/uv:latest" in the
COPY --from=... line) with an explicit version tag and its SHA256 digest (e.g.,
ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and
digest using docker manifest inspect (or the uv Docker integration docs) and
update the COPY --from=... source to that pinned reference to ensure
reproducible, verifiable builds.

In `@pyspectrafuse/maracluster_dat.py`:
- Around line 147-149: The code defines a columns list including 'rt' but still
writes Spectrum records with retentionTime=0.0; update the Spectrum/dict
construction to use the parsed rt value instead of the hard-coded 0.0 (e.g.,
retentionTime=rt or retentionTime=row['rt']), and make the same change at the
other spot flagged (the second Spectrum/dict creation). Ensure the 'rt' value is
cast to float if necessary before assigning to retentionTime so the parquet
persists the actual retention time.
- Around line 196-218: The code converts mz_array and intensity_array to numpy
arrays then calls _bin_spectrum but doesn't verify they have the same length;
when they differ _bin_spectrum may index out-of-bounds. Before computing
charge/prec_mz/etc., add a validation step comparing lengths of mz_arr and
int_arr (and that both lengths > 0) and skip the spectrum (increment skipped) if
they differ or are zero-length; reference the variables mz_arr, int_arr and the
called function _bin_spectrum when adding this guard so only well-formed arrays
reach _bin_spectrum.
- Around line 264-279: The converter is currently grabbing any ".parquet" files
(via find_target_ext_files) and produces .dat outputs for non-spectrum files;
restrict it to spectrum inputs by changing the finder/filter to only match
".psm.parquet" (e.g., call find_target_ext_files(parquet_dir, '.psm.parquet') or
post-filter parquet_files to keep only names ending with '.psm.parquet'), and
keep the FileNotFoundError when the filtered list is empty; reference the
existing parquet_files variable and the find_target_ext_files call so the filter
change is localized and prevents producing .dat for cluster metadata or
membership files.

In `@README.md`:
- Line 162: The fenced code block at the README's project tree is missing a
language identifier (triggering markdownlint MD040); update the opening fence
from ``` to ```text (or another appropriate language like ```bash) so the block
becomes labeled (e.g., change the block around the pyspectrafuse tree to start
with ```text) and leave the closing ``` as-is to satisfy the linter and improve
rendering.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9eb047c6-ab59-4739-9146-a66082074b31

📥 Commits

Reviewing files that changed from the base of the PR and between 3116dcc and a2fa10a.

📒 Files selected for processing (5)

Dockerfile
README.md
pyproject.toml
pyspectrafuse/maracluster_dat.py
tests/conftest.py

coderabbitai · 2026-04-13T22:42:56Z

    && rm -rf /var/lib/apt/lists/*

+## Install uv
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the recommended secure Docker pattern for ghcr.io/astral-sh/uv (pinning by explicit version and sha256 digest)?

💡 Result:

The recommended secure Docker pattern for ghcr.io/astral-sh/uv is to pin to an explicit version tag and further pin by SHA256 digest for reproducibility and security against tag movement. Official uv documentation provides this example in a multi-stage Dockerfile: FROM python:3.12-slim-trixie COPY --from=ghcr.io/astral-sh/uv:0.11.6 /uv /uvx /bin/ For maximum security, replace the tag with @sha256:: COPY --from=ghcr.io/astral-sh/uv@sha256:2381d6aa60c326b71fd40023f921a0a3b8f91b14d5db6b90402e65a635053709 /uv /uvx /bin/ To obtain the current digest for a specific tag (e.g., :0.4.17), run: docker manifest inspect ghcr.io/astral-sh/uv:0.4.17 (look for "digest" under the platform config). Available tags include :latest, :{version} (e.g., 0.11.6), and variants like :debian, :python3.12-trixie. View all at https://ghcr.io/astral-sh/uv. Images are signed with GitHub attestations; verify with gh attestation verify oci://ghcr.io/astral-sh/uv:0.11.6. This ensures supply chain security by preventing image substitution.

Citations:

1: https://docs.astral.sh/uv/guides/integration/docker/

2: Docker documentation should cover pinning with sha256 astral-sh/uv#6565

3: [docs/integration/docker] add sha pinning tip astral-sh/uv#10955

4: https://ghcr.io/astral-sh/uv

5: Publish Docker image: ghcr.io/astral-sh/uv astral-sh/uv#2774

Pin the uv image by explicit version tag and SHA256 digest.

The ghcr.io/astral-sh/uv:latest tag is non-reproducible and weakens supply-chain control. Use an explicit version tag combined with SHA256 digest pinning to ensure build reproducibility and prevent image substitution attacks.

Proposed change

-COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ +COPY --from=ghcr.io/astral-sh/uv:<pinned-version>@sha256:<pinned-digest> /uv /uvx /bin/

Consult the official uv Docker integration guide for available version tags and instructions to retrieve the corresponding digest using docker manifest inspect.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 34, Replace the non-reproducible image reference used as a build stage source in the COPY statement (currently "ghcr.io/astral-sh/uv:latest" in the COPY --from=... line) with an explicit version tag and its SHA256 digest (e.g., ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and digest using docker manifest inspect (or the uv Docker integration docs) and update the COPY --from=... source to that pinned reference to ensure reproducible, verifiable builds.

coderabbitai · 2026-04-13T22:42:56Z

 ## Copy project files
 COPY pyproject.toml .
 COPY pyspectrafuse/ ./pyspectrafuse/

-## Install the package
-RUN pip install --no-cache-dir -e .
+## Install the package with uv
+RUN uv pip install --system --no-cache .


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Checking declared readme in pyproject.toml:" rg -n '^\s*readme\s*=' pyproject.toml echo echo "Checking copied files in Dockerfile:" rg -n '^\s*COPY\s+' Dockerfile

Repository: bigbio/pyspectrafuse

Length of output: 291

Add missing README.md to Docker build context before install.

When uv pip install . runs from the local directory, it reads pyproject.toml which declares readme = "README.md" at line 9. The build will fail because README.md is not copied into the Docker image before the install command executes. Add the missing file between the pyproject.toml and pyspectrafuse/ copies:

Fix

## Copy project files COPY pyproject.toml . +COPY README.md . COPY pyspectrafuse/ ./pyspectrafuse/ ## Install the package with uv RUN uv pip install --system --no-cache .

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Copy project files

COPY pyproject.toml .

COPY pyspectrafuse/ ./pyspectrafuse/

## Install the package

RUN pip install --no-cache-dir -e .

## Install the package with uv

RUN uv pip install --system --no-cache .

## Copy project files

COPY pyproject.toml .

COPY README.md .

COPY pyspectrafuse/ ./pyspectrafuse/

## Install the package with uv

RUN uv pip install --system --no-cache .

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 39 - 44, The Docker build fails because pyproject.toml declares readme = "README.md" but README.md is not copied before running RUN uv pip install --system --no-cache . Update the Dockerfile so that README.md is copied into the image before the install step (i.e., place a COPY README.md . between the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines) so the package installation can read the declared README during build.

coderabbitai · 2026-04-13T22:42:56Z

+    columns = ['mz_array', 'intensity_array', 'charge',
+               'observed_mz', 'calculated_mz', 'rt', 'scan',
+               'run_file_name', 'peptidoform']


⚠️ Potential issue | 🟠 Major

Persist the parquet RT instead of hard-coding 0.0.

rt is loaded here, but every Spectrum record is still written with retentionTime=0. Any MaRaCluster mode that consumes RT from the .dat payload will see all spectra at time zero.

Proposed fix

runs = df['run_file_name'].values if 'run_file_name' in df.columns else [''] * len(df) peps = df['peptidoform'].values if 'peptidoform' in df.columns else [''] * len(df) scans = df['scan'].values if 'scan' in df.columns else [0] * len(df) + rts = df['rt'].values if 'rt' in df.columns else np.zeros(len(df), dtype=np.float32) ... + rt = rts[i] + rt = 0.0 if rt is None or np.isnan(rt) else float(rt) dat_buf.extend(_pack_spectrum( - file_idx, scannr, charge, prec_mz, 0.0, peak_bins)) + file_idx, scannr, charge, prec_mz, rt, peak_bins))

Also applies to: 226-227

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/maracluster_dat.py` around lines 147 - 149, The code defines a columns list including 'rt' but still writes Spectrum records with retentionTime=0.0; update the Spectrum/dict construction to use the parsed rt value instead of the hard-coded 0.0 (e.g., retentionTime=rt or retentionTime=row['rt']), and make the same change at the other spot flagged (the second Spectrum/dict creation). Ensure the 'rt' value is cast to float if necessary before assigning to retentionTime so the parquet persists the actual retention time.

coderabbitai · 2026-04-13T22:42:56Z

+                mz_arr = np.asarray(mz_arr, dtype=np.float64)
+                int_arr = np.asarray(int_arr, dtype=np.float64)
+
+                if len(mz_arr) == 0 or len(int_arr) == 0:
+                    skipped += 1
+                    continue
+
+                charge = int(charges[i])
+                if charge <= 0:
+                    skipped += 1
+                    continue
+
+                # Precursor m/z: prefer observed_mz, fall back to calculated_mz
+                prec_mz = obs_mz[i] if obs_mz is not None else None
+                if prec_mz is None or np.isnan(prec_mz):
+                    prec_mz = calc_mz[i] if calc_mz is not None else 0.0
+                prec_mz = float(prec_mz)
+
+                # Neutral mass for peak filtering
+                prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)
+
+                # Bin the spectrum
+                peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)


⚠️ Potential issue | 🟠 Major

Validate fragment array lengths before binning.

If a parquet row has different-length mz_array and intensity_array, _bin_spectrum() can index past the end of mz_array and kill the whole conversion instead of skipping one bad spectrum.

Proposed fix

- if len(mz_arr) == 0 or len(int_arr) == 0: + if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr): skipped += 1 continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mz_arr = np.asarray(mz_arr, dtype=np.float64)

int_arr = np.asarray(int_arr, dtype=np.float64)

if len(mz_arr) == 0 or len(int_arr) == 0:

skipped += 1

continue

charge = int(charges[i])

if charge <= 0:

skipped += 1

continue

# Precursor m/z: prefer observed_mz, fall back to calculated_mz

prec_mz = obs_mz[i] if obs_mz is not None else None

if prec_mz is None or np.isnan(prec_mz):

prec_mz = calc_mz[i] if calc_mz is not None else 0.0

prec_mz = float(prec_mz)

# Neutral mass for peak filtering

prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)

# Bin the spectrum

peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)

mz_arr = np.asarray(mz_arr, dtype=np.float64)

int_arr = np.asarray(int_arr, dtype=np.float64)

if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr):

skipped += 1

continue

charge = int(charges[i])

if charge <= 0:

skipped += 1

continue

# Precursor m/z: prefer observed_mz, fall back to calculated_mz

prec_mz = obs_mz[i] if obs_mz is not None else None

if prec_mz is None or np.isnan(prec_mz):

prec_mz = calc_mz[i] if calc_mz is not None else 0.0

prec_mz = float(prec_mz)

# Neutral mass for peak filtering

prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)

# Bin the spectrum

peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/maracluster_dat.py` around lines 196 - 218, The code converts mz_array and intensity_array to numpy arrays then calls _bin_spectrum but doesn't verify they have the same length; when they differ _bin_spectrum may index out-of-bounds. Before computing charge/prec_mz/etc., add a validation step comparing lengths of mz_arr and int_arr (and that both lengths > 0) and skip the spectrum (increment skipped) if they differ or are zero-length; reference the variables mz_arr, int_arr and the called function _bin_spectrum when adding this guard so only well-formed arrays reach _bin_spectrum.

coderabbitai · 2026-04-13T22:42:56Z

+    """Convert all .psm.parquet files in a directory to .dat format.
+
+    Args:
+        parquet_dir: Directory containing .psm.parquet files
+        output_dir: Directory to write .dat files
+        file_idx_start: Starting file index
+        charge_filter: If set, only include spectra with this charge
+
+    Returns:
+        Tuple of (list of .dat file paths, total spectra written)
+    """
+    from pyspectrafuse.commands.spectrum2msp import find_target_ext_files
+
+    parquet_files = find_target_ext_files(parquet_dir, '.parquet')
+    if not parquet_files:
+        raise FileNotFoundError(f"No parquet files found in {parquet_dir}")


⚠️ Potential issue | 🟠 Major

Restrict dataset conversion to the spectrum parquet inputs.

The docstring says this path converts .psm.parquet files, but find_target_ext_files(parquet_dir, '.parquet') will also pick up outputs like cluster_metadata.parquet and psm_cluster_membership.parquet. Those files do not have the spectrum arrays this converter expects, so this silently generates useless .dat artifacts for non-spectrum inputs.

Proposed fix

- parquet_files = find_target_ext_files(parquet_dir, '.parquet') + parquet_files = find_target_ext_files(parquet_dir, '.psm.parquet')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyspectrafuse/maracluster_dat.py` around lines 264 - 279, The converter is currently grabbing any ".parquet" files (via find_target_ext_files) and produces .dat outputs for non-spectrum files; restrict it to spectrum inputs by changing the finder/filter to only match ".psm.parquet" (e.g., call find_target_ext_files(parquet_dir, '.psm.parquet') or post-filter parquet_files to keep only names ending with '.psm.parquet'), and keep the FileNotFoundError when the filtered list is empty; reference the existing parquet_files variable and the find_target_ext_files call so the filter change is localized and prevents producing .dat for cluster metadata or membership files.

coderabbitai · 2026-04-13T22:42:56Z

+## Project Structure

-Run the test suite:
+```


⚠️ Potential issue | 🟡 Minor

Add a language identifier to the fenced block.

Line 162 uses a fenced code block without a language, which triggers markdownlint MD040.

Proposed fix

-``` +```text pyspectrafuse/ ├── maracluster_dat.py # Parquet to .dat binary conversion ... └── mgf_convert/ # MGF conversion utilities

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 162-162: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` at line 162, The fenced code block at the README's project tree is missing a language identifier (triggering markdownlint MD040); update the opening fence from ``` to ```text (or another appropriate language like ```bash) so the block becomes labeled (e.g., change the block around the pyspectrafuse tree to start with ```text) and leave the closing ``` as-is to satisfy the linter and improve rendering.

ypriverol and others added 5 commits April 11, 2026 12:01

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

ypriverol and others added 3 commits April 13, 2026 20:59

Remove MGF/bypass/legacy terminology from README

cc175b5

Parquet-to-dat is the standard conversion path, not a bypass of anything. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

	basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()
	basename = ParquetPathHandler(parquet_dir).get_item_info()

Conversation

ypriverol commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading